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Abstract 

■ Goodness-of-fit tests based on the Euclidean distance often outperform x an d other 
£S| . classical tests (including the standard exact tests) by at least an order of magnitude 

when the model being tested for goodness-of-fit is a discrete probability distribution 
r-pl that is not close to uniform. The present article discusses numerous examples of this. 

Goodness-of-fit tests based on the Euclidean metric are now practical and convenient: 
although the actual values taken by the Euclidean distance and similar goodness-of-fit 
statistics are seldom humanly interpretable, black-box computer programs can rapidly 
c/2 . calculate their precise significance. 
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1 Introduction 

A basic task in statistics is to ascertain whether a given set of independent and identically 
distributed (i.i.d.) draws does not come from a given "model," where the model may consist 
of either a single fully specified probability distribution or a parameterized family of prob- 
ability distributions. The present paper concerns the case in which the draws are discrete 
random variables, taking values in a finite or countable set. In accordance with the standard 
terminology, we will refer to the possible values of the discrete random variables as "bins" 
("categories," "cells," and "classes" are common synonyms for "bins"). 

A natural approach to ascertaining whether the i.i.d. draws do not come from the model 
uses a root-mean-square statistic. To construct this statistic, we estimate the probability 
distribution over the bins using the given i.i.d. draws, and then measure the root-mean- 
square di fference between this empirical d istribution and the model distribution (see, for 



example, iRaol . l2002t IVaradhan et al.l . 11974 page 123; or Section [2] below). If the draws do 
in fact arise from the model, then with high probability this root-mean-square is not large. 
Thus, if the root-mean-square statistic is large, then we can be confident that the draws do 
not arise from the model. 

To quantify "large" and "confident," let us denote by x the value of the root-mean-square 
for the given i.i.d. draws; let us denote by X the root-mean-square statistic constructed 
for different i.i.d. draws that definitely do in fact come from the model (if the model is 
parameterized, then we draw from the distribution corresponding to the parameter given by 
a maximum-likelihood estimate for the experimental data). The "P- value" P is then defined 
to be the probability that X > x (viewing X — but not random variable). Given 

the P-value P, we can have 100(1 — P)% confidence that the draws do not arise from the 
model. 
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Now, the P-values for the simple root-mean-square statistic can be different functions of 
x for different model probability distributions. To avoid this seeming inconvenience asymp- 
totically (in the limit of large numbers of draws), K. Pearson replaced the uniformly weighted 
mean in the root-mean-square with a weighted average; the weights are the reciprocals of the 
model probab ilities associate d with the various bins. This produces the classic \ 2 statistic 
introduced by iPearsonl (119001 ) — see, for example, formula (T5]) below. However, when model 
probabilities can be small (relative to others in the same model distribution), this weighted 
average can involve division by nearly zero. As demonstrated below, dividing by nearly zero 
severely restricts the statistical power of x 2 — even in the absence of round-off errors - 
especially when dividing by nearly zero for each of many bins. The p roblem arises whether 
or not every bin contains several draws (see Remark 1 1.1 1) . [Press! ( 20051 ) tackled similar issues. 

The main thesis of the present article is that using only the classic x 2 statistic is no 
longer appropriate, that certain alternatives are far superior now that computers are widely 
available. As illustrated below, the simple root-mean-square, used in conjunction with 
the log-likelihood-ratio "G 2 " goodness-of-fit statistic, is generally preferable to the clas- 
sic x 2 statistic. (The log-likelihood-ratio also involves division by nearly zero, but tempers 
this somewhat by taking a logarithm.) We do not make any claim that this is the best 
po ssible alternat i ve. In fac t, the discrete Kolmogorov-Smir nov and related statistics used 
by lClauset et al.l (120091 ) and lD'Agostino and Stephens! (119861 ) can be more powerful than the 
root-mean-square in certain circumstances; in any case, the discrete Kolmogorov-Smirnov 
statistic and the root-mean-square are similar in many ways, and complementary in others. 
We focus on the root-mean-square largely because it is so simple and easy to understand; 
for example, computing the P-values of the root-mean-square in the limit of large numbers 
of draws is trivial, even when estimating conti nuous parameters via maximum-likelihood 
methods, as discussed by iPerkins et al. Furthermore, the classic x 2 statistic is 

just a weighted version of the root-mean-square, facilitating their comparison. Finally, x 2 
and the root-mean-square coincide when the model distribution is uniform. 

Please note that all statistical tests reported in the present paper (including those involv- 
ing the x 2 statistic) are exact; we compute P-values via Monte-Carlo simulations providing 
guaranteed error bounds (see Section [3] below). In all numerical results reported below, we 
ge nerated random numbers via the C programming language procedure given on page 9 
of iMarsaglial (120031 ). implementing the recommended complementary multiply with carry. 

To be sure, the problem with x 2 is neither subtle nor esoteric. For a particularly revealing 
example, see Subsection 14.51 below. 

Appropriate rebinning to uniformize the probabilities associated with the bins can miti- 
gate much of the problem with x 2 ■ Yet rebinning is a black art that is liable to improperly 
influence the result of a goodness-of-fit test. Moreover, rebinning requires careful extra work, 
making x 2 less easy-to-use. A principal advantage of the root-mean-square is that it does not 
require any rebinning; indeed, the root-mean-square is most powerful without any rebinning. 



Remark 1.1. In many of our examples, there is a bin for which the expected number of 
draws is very small under the model. Please note that, although it is natural for the expected 
numbers of draws for some bins to be very small, especially when the model has many bins, 
the advantage of the root-mean-square over x 2 is substantial even when the expected number 
of draws is at least five for every bin; see, for example, Subsection 15.1.11 or Subsection 15.2.41 
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Remark 1.2. Goodness-of-fit tests are probably most useful in practice not for ascertaining 
whether a model is correct or not, but for determining whether the discrepancy between the 
model and experiment is larger than expected random fluctuations. While models outside 
the physical sciences typically are not exactly correct, testing the validity of using a model 
for virtually any purpose requires knowing whether observed discrepancies are due to inaccu- 
racies or inadequacies in the models or (on the contrary) could be due to chance arising from 
necessarily finite sample sizes. Thus, goodness-of-fit tests are critical even when the models 
are not supposed to be exactly correct, in order to gauge the size of the unavoidable random 
fluctu ations. For further clarification, see the remarkabl y extensiye titl e use d by [Pearson 
(119001 ) introducing x 2 ', see also the modern treatments by iGelmanl <j2003h and [c3 fl2006f ). 



Remark 1.3. Combining the r oot-m ean-square methodology and the statistical bootstrap 
given by lEfron and Tibshiranil (119931 ) should produce a test for whether two separate sets 
of draws arise from the same or from different distributions, when each set is taken i.i.d. 
from some (unspecified) distribution; the two distributions associated with the sets may 
differ. This is related to testing for association/independence/homogeneity in contingency- 
tables/cross-tabulations that have only two rows. 



2 Definitions of the test statistics 

In this section, we review the definitions of four goodness-of-fit statistics — the root-mean- 
square, x 2 1 the log-likelihood-ratio or G 2 , and the Freeman- Tukey or Hellinger distance. 
The latter three statistics are the b est-known m embers of the standard Cressie-Read power- 



divergence family, as discussed by iRaol (120021 ). We use p^\ . . . , p^ to denote the 



modeled fractions of n i.i.d. draws falling in m bins, numbered 1, 2, . . . , m, respectively, 
and we use p^ l \ p^ 2 \ . . . , p^ to denote the observed fractions of the n draws falling in 
the respective bins. That is, p^\ p^\ p$ are the probabilities associated with the 
respective bins in the model distribution, whereas p^\ p( 2 \ . . . , jS*" 1 -* are the fractions of the 
n draws falling in the respective bins when we take the draws from a distribution that may 
differ from the model — their actual distribution. Specifically, if ii, i 2 , . . . , i n are the observed 
i.i.d. draws, then is - times the number of i%, i 2 , . . ■ , i n falling in bin j, for j = 1, 2, 
. . . , m. If the model is parameterized by a parameter 9, then the probabilities p , p Q , 
. . . , p^ are functions of 9; if the model is fully specified, then we can view the probabilities 
p^\ Pq 2 \ . . . , p "^ as constant as functions of 9. We use 9 to denote a maximum-likelihood 
estimate of 9 obtained from p^\ p( 2 \ . . . , p( m \ 

With this notation, the root-mean-square statistic is 



x 



1 m 
\ 3=1 



We use the designation "root-mean-square" to refer to x. 
The classical Pearson x 2 statistic is 

X" = n 



, v (p (j) -p ( o j> (0)Y m 

3=1 PO W 
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under the convention that (p^ — Pq 0)) 2 /Po 0) 

-" to refer to x 2 ■ 

"G 2 " statistic is 



if p { i\6) = = pW. We use the 



standard designation "x z 

The log-likelihood-ratio or 



2n^2p (j) In 



P 



U) 



(3) 



0) 



under the convention that p^ ]n(p^ /pq\§)) = if p 
"G 2 " to refer to # 2 . 

The Freeman- Tukey or Hellinger-distance statistic is 



0. We use the common designation 





• (4) 



We use the well-known designation "Freeman- Tukey" to refer to h 2 . 

In the limit that the number n of draws is large, the distributions of x 2 defined in (J2J), 
g 2 defined in ([3]) , and h 2 defined in (j3J) are all the same when the actu al underlyin g distribu- 
tion of the draws comes from the model, as discussed, for example, by iRaol (120021 ) . However, 
when the number n of draws is not large, then their distributions can differ substantially. In 
all our data and power analyses, we compute P-values via Monte-Carlo simulations, without 
relying on the number n of draws to be large. 



3 Hypothesis tests with parameter estimation 

In this section, we discuss the testing of hypotheses involving parameterized models: Given a 
family po(9) of probability distributions parameterized by 9, and given observed i.i.d. draws 
from some actual underlying (unknown) distribution p, we would like to test the hypothesis 

H' Q : for some 9, p = Po(9), (5) 

against the alternative 

H[ : for all 9, p^p (9). (6) 

Given only finitely many draws, the P-value for such a test would have to be independent 
of the parameter 9, since the proper value for 9 is unknown (9 is known as a "nuisance" 
parameter). Unfortunately, it is not clear how to devise such a test when the probability dis- 
tributions are discrete. None of the standard methods (including x 2 , the log-likelihood-ratio, 
the Freeman- Tukey/Hellinger distance, and other Cressie-Read power-divergence statistics) 
produce P-values that are independent of the parameter 9. Some methods do produce P- 
values that are independent of 9 in the limit of large numbers of draws, but this is not 
especially useful, since in the limit of large numbers of draws any actual parameter 9 would 
be almost surely know n exactly anyway; further elaboration is available in Appendix B 



of iPerkins et all fl2011af ). 



In the present paper, we test the significance of assuming 

Hq : p = po(9) for the particular observed value of 9, (7) 
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where 9 is a maximum-likelihood estimate of 9; i.e., H Q is the hypothesis that p = Po(9) for 
the value of 9 associated with the single realization of the experiment that was measured 
(subsequent repetitions of the experiment, including those considered when calculating the 
P- value as in Remark l3.3l can yield different estimates of the parameter, even though the rep- 
etitions' actual distribution p is the same). Of course, the accuracy of the estimate 9 generally 
improves as the number of draws increases; testing (J3J) and testing (jTJ a re asymp t otical ly 



equivalent, in the limit of large numbers of draws, under the conditions of iRomanol ( 119881 ). 

As testing exactly Hq defined in (J5J) does not seem to be feasible in general when the 
probability distributions are discrete and there are more than just a few bins, we focus 
on testing the closely related assumption Hq defined in (JTj). The latter is more relevant 
for many applications, anyways — plots typically display the particular fitted distribution 
in (GO); interpreting such plots naturally involves (jTJ). All tests of the present paper concern the 
significance of assuming H defined in (jTJ) (if the model is fully specified, then the probability 
distribution po(9) is the same for all 9). Please be sure to bear in mind Remark 11.21 of 
Section [TJ A significance test simply gauges the consistency of the observed data with our 
assumption] we are not trying to decide whether the assumption is likely to be true (or false), 
nor are we trying to decide whether some alternative assumption is likely to be true (or false). 

Remark 3.1. Another means of handling nuisance parameters is to test the hypothesis 

Hq : p = Po(9) for all possible realizations of the experiment; (8) 

that is, Hq is the hypothesis that p = po(9) and that po{9) always takes exactly the same 
value during repetitions of the experiment. The assumption that (jSJ) is true seems to be 
more extreme, a more substantial dep arture from J3D , than ([7]). Nevertheless, testing (jSJ) is 



standard; see, for example, Section 6 of lCochranl (119521 ). Assuming (|Sj) amounts to condition 



ing on a statistic that is minimally sufficient for estimating 9; computing the associated 
P-values is not always trivial. Testing the significance of assuming (JTJ) would seem to be 
more apropos in practice for applications in which the experimental design does not enforce 
that repeated experiments always yield the same value for Pq{9). 

Remark 3.2. The parameter 9 can be integer-valued, real-valued, complex-valued, vector- 
valued, matrix-valued, or any combination of the many possibilities. For instance, when we 
do not know the proper ordering of the bins a priori, we must include a parameter that 
contains a permutation (or permutation matrix) specifying the order of the bins; maximum- 
likelihood estimation then entails sorting the model and all empirical frequencies (whether 
experimental or simulated) — see Subsection 14.21 for details. With Remark 13.31 we need not 
contemplate how many degrees of freedom are in a permutation. 

Remark 3.3. To compute the P-value assessing the consistency of the experime ntal data 



with a ssuming (jTJ), we can use Monte-Carlo simulations (very similar to those used by lClauset et al 



( 20091 )). First, we estimate the parameter 9 from the n given experimental draws, obtaining 
9, and calculate the statistic (x 2 , G 2 , Freeman- Tukey, or the root-mean-square), using the 
given data and taking the model distribution to be Po(9). We then run many simulations. 
To conduct a single simulation, we perform the following three-step procedure: 

1. we generate n i.i.d. draws according to the model distribution po(9), where 9 is the 
estimate calculated from the experimental data, 
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2. we estimate the parameter 9 from the data generated in Step 1, obtaining a new 
estimate 9, and 

3. we calculate the statistic under consideration (x 2 , G 2 , Freeman- Tukey, or the root- 
mean-square), using the data generated in Step 1 and taking the model distribution 
to be Po(0), where 9 is the estimate calculated in Step 2 from the data generated in 
Step 1. 

After conducting many such simulations, we may estimate the P- value for assuming ([7]) as the 
fraction of the statistics calculated in Step 3 that are greater than or equal to the statistic 
calculated from the empirical data. The accuracy of the estimated P-value is inversely 
proportional to the square root of the number of simulations conducted; for details, see 
Remark 13.41 below. This procedure works since, by definition, the P-value is the probability 
that 



d 



( P W \ 



^ 2) (Q) 



U m) (e) / J 



> 



d 



( P {1) \ 
p 



(2) 



p8°(0) 



\ p {m) ) 



(9) 



where 



m is the number of all possible values that the draws can take, 

d is the measure of the discrepancy between two probability distributions over m bins 
(i.e., between two vectors each with m entries) that is associated with the statistic 
under consideration (d is the Euclidean distance for the root-mean-square, a weighted 
Euclidean distance for x 2 , the Hellinger distance for the Freeman- Tukey statistic, and 
the relative entropy — the Kullback-Leibler divergence — for the log-likelihood-ratio), 



p 1 - >, p 1 - \ . . . , p( m > are the fractions of the n given experimental draws falling in the 
respective bins, 

9 is the estimate of 9 obtained from pW, r)( m ) 



are the fractions of n i.i.d. draws falling in the respective bins 
when taking the draws from the distribution po(9) assumed in (J7J), and 

• is the estimate of the parameter 9 obtained from pW, p( 2 \ . . . , p( m ) (note that G 
is not necessarily always equal to 9: even under the null hypothesis, repetitions of the 
experiment could yield different estimates of the parameter; see also Remark 13. 5p . 

When taking the probability that (Q occurs, only the left-hand side is random — we regard 
the left-hand side of (Q as a random variable and the right-hand side as a fixed number 
determined via the experimental data. As with any probability, to compute the probability 
that occurs, we can calculate many independent realizations of the random variable and 
observe that the fraction which satisfy is a good approximation to the probability when 
the number of realizations is large; Remark 13.41 details the accuracy of the approximation. 
(The procedure in the present remark follows this prescription to estimate P- values.) 
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Remark 3.4. The standard error of the estimate from Remark [33] for an exact P-value P is 
v/P(l — P)/£, where £ is the number of Monte-Carlo simulations conducted to produce the 
estimate. Indeed, each simulation has probability P of producing a statistic that is greater 
than or equal to the statistic corresponding to an exact P-value of P. Since the simulations 
are all independent, the number of the I simulations that produce statistics greater than or 
equal to that corresponding to P-value P follows the binomial distribution with I trials and 
probability P of success in each trial. The standard deviation of the number of simulations 
whose statistics are greater than or equal to that corresponding to P-value P is therefore 
y/£P{l — P), and so the standard deviation of the fraction of the simulations producing 
such statistics is ^(1 - P)/L Of course, the fraction itself is the Monte-Carlo estimate 
of the exact P-value (we use this estimate in place of the unknown P when calculating the 
standard error 



<P{l-P)/l). 

Remark 3.5. For any family po(9) of discrete probability distributions parameterized by 
a permutation 9 that specifies the order of the bins (meaning that there exists a discrete 
probability distribution q such that p$\0) = q^ 6 ^^ for all j), and for any number n of draws, 
the P-values defined in Remark 13.31 have the following highly desirable property: Suppose 
that the actual underlying distribution p of the experimental draws is equal to po(9) for 
some (unknown) 9. Suppose further that P is the P-value for assuming ([7|), calculated for 
a particular realization of the experiment. Consider repeating the same experiment over 
and over, and calculating the P-value for each realization, each time using that realization's 
particular maximum-likelihood estimate of the parameter in the hypothesis (j7J). Then, the 
fraction of the P-values that are greater than or equal to P is equal to P in the limit of 
many repetitions of the experiment. This property is a compelling reason to use d(P,po{Q)) 
rather than d(P,po(9)) in the left-hand side of Also, the procedure of Remark 13.31 can 
be viewed as a parametric bootstra p app roximation, as disc ussed, for example, by Romano! 
fll988h . lEfron and Tibshiranil ril993t ). and iBickel eFatl fl2006f ): 

For any family po(9) of discrete probability distributions, the P-values defined in Re- 
mark 13.31 have the following additional highly desirable property: Suppose that the actual 
underlying distribution p of the experimental draws is equal to Po(9) for some (unknown) 
9. Consider repeating the experiment over and over, and calculating the P-value for each 
realization, each time using that realization's particular maximum-likelihood estimate of the 
parameter in the hypothesis ([7]). Then, the resulting P-values converge in distribution to 
the uniform distribution over (0, 1), in the limit of large numbers of draws. 

It may be som ewhat fortuitous that the sche me in Remark 13.31 has so many favorable 
properties; indeed, iBayarri and Bergerl (120001 ) and iRobins et al.l (120001 ) (among others) have 
pointed out problems with certain generalizations. 



4 Data analysis 

In this section, we use several data sets to investigate the performance of goodness-of-fit 
statistics. The root-mean-square generally performs much better than the classical statistics. 
We take the position that a user of statistics should not have to worry about rebinning; we 
discuss rebinning only briefly. We compute all P-values via Monte Carlo as in Remark 13. 3j 
Remark 13.41 details the guaranteed accuracy of the computed P-values. 
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4.1 Synthetic examples 

To better explicate the performance of the goodness-of-fit statistics, we first analyze some 
toy examples. We consider the model distribution 

P?' = J, (10) 



and 



1 



Po } = : (12) 

10 2m -4 v ; 

for j — 3, 4, . . . , m. For the empirical distribution, we first use n = 20 draws, with 15 in 
the first bin, 5 in the second bin, and no draw in any other bin. This data is clearly unlikely 
to arise from the model specified in (|T0l) -( fT2l . but we would like to see exactly how well the 
various goodness-of-fit statistics detect the obvious discrepancy. 

Figure [1] plots the P- values for testing whether the empirical data arises from the model 
specified in ( 1T0"|) - (|T2"|) . We computed the P- values via 4,000,000 Monte-Carlo simulations (i.e., 
4,000,000 per empirical P- value being evaluated), with each simulation taking n = 20 draws 
from the model. The root-mean-square consistently and with extremely high confidence 
rejects the hypothesis that the data arises from the model, whereas the classical statistics find 
less and less evidence for rejecting the hypothesis as the number m of bins increases; in fact, 
the P-values for the classical statistics get very close to 1 as m increases — the discrepancy 
of (jl~2l) from is usually less than the discrepancy of (fl2l) from a typical realization drawn 
from the model, since under the model the sum of the expected numbers of draws in bins 
3, 4, . . . , m is n/2. 

Figure [1] demonstrates that the root-mean-square can be much more powerful than the 
classical statistics, rejecting with nearly 100% confidence while the classical statistics report 
nearly 0% confidence for rejection. Moreover, the classical statistics can report P-values very 
close to 1 even when the data manifestly does not arise from the model. (Incidentally, the 
model for smaller m can be viewed as a rebinning of the model for larger m. The classical 
statistics do reject the model for smaller m, while asserting for larger m that there is no 
evidence for rejecting the model.) The performance of the classical statistics displays a 
dramatic dependence on the number (m — 2) of unlikely bins in the model, even though the 
data are the same for all m. This suggests a sure-fire scheme for supporting any model (no 
matter how invalid) with arbitrarily high P-values: just append enough irrelevant, more or 
less uniformly improbable bins to the model, and then report the P-values for the classical 
goodness-of-fit statistics. In contrast, the root-mean-square robustly and reliably rejects the 
invalid model, independently of the size of the model. 

We will see in the following section that the classic Zipf power law behaves similarly. 

For another example, we again consider the model specified in (fT0"l) - f[T2"|) . For the empir- 
ical distribution, we now use n = 96 draws, with 36 in the first bin, 12 in the second bin, 
1 each for bins 3, 4, . . . , 50, and no draw in any other bin. As before, this data clearly is 
unlikely to arise from the model specified in (fIU]) - (lT2"]) . but we would like to see exactly how 
well the various goodness-of-fit statistics detect the obvious discrepancy. 



9 




Figure 1: P- values for the hypothesis that the model ffT0]) - ffT2]) agrees with the data of 15 
draws in the first bin, 5 draws in the second bin, and no draw in any other bin 



Figure [2] plots the P- values for testing whether the empirical data arises from the model 
specified in (fl0|) - (fl2|) . We computed the P- values via 160,000 Monte-Carlo simulations (that 
is, 160,000 per empirical P- value being evaluated), with each simulation taking n = 96 draws 
from the model. Yet again, the root-mean-square consistently and confidently rejects the 
hypothesis that the data arises from the model, whereas the classical statistics find little 
evidence for rejecting the manifestly invalid model. 



4.2 Zipf 's power law of word frequencies 



Zipf popularized his eponymous law by analyzing four "chief sources of statistical data re- 
ferred to in th e ma i n text " (this is a quotation from the "Notes and R eferences" secti on - 
page 311 — of lziplfll935h) : the ch ief source for the English language is lEldridgd ( 120101 ) . We 
revisit the data of lEldridgd ( 120101 ) in the present subsection to assess the performance of the 
goodness-of-fit statistics. 



We first analyze List 1 of lEldridgd (j2010[ ). which consists of 2,890 different English words, 
such that there are 13,825 words in total counting repetitions; the words come from the 
Buffalo Sunday News of August 8, 1909. We randomly choose n = 10,000 of the 13,825 
words to obtain a corpus of n = 10,000 draws over 2,890 bins. Figure [3] plots the frequencies 
of the different words when sorted in rank order (so that the frequencies are nonincreasing) . 
Using goodness-of-fit statistics we test the significance of the (null) hypothesis that the 
empirical draws actually arise from the Zipf distribution 



(13) 
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Figure 2: P-values for the hypothesis that the model (}T0"]) - (fl2"]) agrees with the data of 36 
draws in the first bin, 12 draws in the second bin, 1 draw each in bins 3, 4, ... , 50, and no 
draw in any other bin 



for j = 1, 2, 



m, where 9 is a permutation of the integers 1, 2, 



m, and 



1 



£7=i Vi' 



(14) 



we estimate the permutation 9 via maximum-likelihood methods, that is, by sorting the 
frequencies: first we choose j\ to be the number of a bin containing the greatest number of 
draws among all m bins, then we choose j'2 to be the number of a bin containing the greatest 
number of draws among the remaining m — 1 bins, then we choose j'3 to be the number of a 
bin containing the greatest among the remaining m — 2 bins, and so on, and finally we find 
9 such that 9(ji) = 1, 9(j 2 ) = 2, . . . , 9(j m ) = m. We have to obtain the ordering 9 from the 
data via such sorting since we do not know the proper ordering a priori. 

Similarly, we do not know the proper value of the number m of bins, so in Figure H] we 
plot P-values (eac h comput e d via 40,000 Monte-Carlo simulations) for varying values of m; 
although List 1 of lEldridgd ( 120101 ) involves only 2,890 distinct words, we must also include 
bins for wor ds that d i d not appear in the original list, words whose frequencies are zeros 
for List 1 of Eldridee fl2010f ). Note that Figure |4] displays the P-values with m = 2,890 for 
reference, even though m must be independent of the data, and so m must be substantially 
larger than 2,890 in order for the assumptions of goodness-of-fit testing to hold. 

With respect to testing goodne ss-of-fit, the num ber m of bins is the number of words in 
the dictionary from which List 1 of lEldridge fboioh was drawn. It is not clear a priori which 
dictionary is appropriate. Fortunately, the P-values for the root-mean-square are always to 
several digits of accuracy, independent of the value of m — the root-mean-square determines 
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that List 1 does not follow the classic Zipf distribution (defined in (|T3|) and ()14p ) for any m. 
In contrast, the P-values for the classical statistics vary wildly depending on the value of m. 
In fact, for any of the classical statistics, and for any prescribed number P between 0.05 and 
0.95, there is at least one value of m between 4,000 and 40,000 such that the P-value is P. 
Thus, without knowing the proper size of the dictionary a priori, the classical statistics are 
meaningless. 



Unsurprisingly, analyzing List 5 of lEldridgd ( 120101 ) produces results analogous to those 
reported above for List 1. List 5 consists of 6,002 different English words, such that there 
ar e 43,989 words i n total counting repetitions; the words come from amalgamating Lists 1-4 
of Eldridge ( 2010l ). We randomly choose n = 20,000 of the 43,989 words to obtain a corpus 
of n = 20,000 draws over 6,002 bins. Figure [5] plots the frequencies of the different words 
when sorted in rank order (so that the frequencies are nonincreasing) . 

Again we do not know the proper value of the number m of bins, so in Figure M we 
plot P-values (eac h comput e d via 40,000 Monte-Carlo simulations) for varying values of m; 
although List 5 of lEldridgd ( 120101 ) involves only 6,002 distinct words, we must also include 
bins for words that did not appear in the original list, words whose frequencies are zeros for 
List 5 of lEldridgd ( 120101 ) . Please note that Figure E] displays the P-values with m = 6,002 for 
reference, even though m must be independent of the data, and so m must be substantially 
larger than 6,002 in order for the assumptions of goodness-of-fit testing to hold. Comparing 
Figures H] and [6] shows that the above remarks about List 1 pertain to the analysis of the 
larger List 5, too. Once again, without knowing the proper size of the dictionary a priori, 
the classical statistics are meaningless, whereas the root-mean-square is very powerful. 

Interestingly, by introducing parameters 6\, 9 2 , and 63 to fit perfectly the bins containing 
the three greatest numbers of draws, a truncate d power-law beco mes a good fit for the corpus 
of 20,000 words drawn randomly from List 5 of lEldridgd (120101 ) . with the number m of bins 
set to 7,500. Indeed, let us consider the model 



0u o (j) 

02, (j) 

c/(e (j)Y\ e (j) 



1 

2 
3 

4,5, 



(15) 



,7500 



where 



C = C, 



01,02,03,04 



1 — 9i — 9 2 — 9 3 



(16) 



with 6*o being a permutation of the integers 1,2,..., 7500, and 9\, 9 2 , #3, #4 being nonnegative 
real numbers; we estimate 9q, 9i, 9 2 , #3, #4 via maximum-likelihood methods, determining 
#0 by sorting as discussed above, and setting 9i, 9 2 , and 9 3 to be the three greatest relative 
frequencies. This model fits the empirical data exactly in the bins whose probabilities under 
the model are 9\, 9 2 , and #3 — there will be no discrepancy between the data and the model 
in those bins — so that these bins do not contribute to any goodness-of-fit statistic, aside 
from altering the number of draws in the remaining bins. Of the 20,000 total draws in 
the given experimental data, 16,486 do not fall in the bins associated with the three most 
frequently occurring words. The maximum-likelihood estimate of the power-law exponent 
94 for the experimental data turns out to be about 1.0484. 
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Figure 3: Numbers of occurrences of the vario us words ( one b in for each distinct word) in a 
corpus of 10,000 random draws from List 1 of Eldridgei ( 2010l ) 



For the model defined in (fl5|) and (Tl6|) . the P- values calculated via 4,000,000 Monte-Carlo 
simulations are 

• x 2 - -510 

• G 2 : .998 

• Freeman- Tukey: 1.000 

• root-mean-square: .587 

Thus, all four statistics indicate that the truncated power-law model defined in (fl5l) and (fl6l) 
is a good fit. This is in accord with Figure |5l in which all but the three greatest frequencies 
appear to follow a truncated power-law. 

4.3 A Poisson law for radioactive decays 

Table [1] summarizes the c l assic example of a Poisson-distributed experiment in radioactive 



decay of iRutherford et al.l (Il910l ); Figure [7] plots the data, along with the Poisson distribution 
whose mean is the same as the data's. Figure [H] reports the P-values for testing whether 
the data, while retaining only bins 1, 2, . . . , m, are distributed according to a Poisson 
distribution (the model Poisson distribution is also truncated to the first m bins, with the 
mean estimated from the data). Since the total number n of draws depends little on the 
numbers in bins 13, 14, 15, ... , the truncation amounts to ignoring draws in bins m+1, m+2, 
m + 3, ... when m > 12, and demonstrates that the scant experimental draws in bins 13-15 
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Figure 4: P- values for the data plotted in Figure E] to follow the Zipf distribution 
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Figure 5: Numbers of occurrences of the vario us words ( one b in for each distinct word) in a 
corpus of 20,000 random draws from List 5 of Eldridgei ( 2010l ) 
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Figure 6: P- values for the data plotted in Figure |5] to follow the Zipf distribution 



strongly influence the P-values of the classical statistics. We computed the P-values via 
40,000 Monte-Carlo simulations (for each number m of bins and each of the four statistics), 
estimating the mean of the model Poisson distribution for each simulated data set. All four 
goodness-of-fit statistics indicate reasonably good agreement between the data and a Poisson 
distribution; the classical statistics are very sensitive in the tail to discrepancies between the 
data and the model distribution, whereas the root-mean-square is relatively insensitive to 
the truncation after 12 or more bins. 



4.4 A Poisson law for counting with a hemacytometer 



Page 357 of IStudentl (119071 ) reports on the number of yeast cells observed in each of 400 
squares in a hsemacytometer microscope slide. Table [2] displays the counts; Figure [9] plots 
them, along with the Poisson distribution whose mean matches the data's. The P-values for 
the data to arise from a Poisson distribution (with the mean estimated from the data) are 

• x 2 : -627 

• G 2 : .365 

• Freeman- Tukey: .111 

• root-mean-square: .490 

We calculated the P-values via 4,000,000 Monte-Carlo simulations, estimating the mean of 
the model Poisson distribution for each simulated data set. Evidently, all four statistics 
report that a Poisson distribution is a reasonably good model for the experimental data. 
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Table 1: Numbers of a-particles emitted by a film of polonium in 2608 intervals of 7.5 seconds 
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Figure 7: The data in Table [1] (the dots) and the best-fit Poisson distribution (the lines) 
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Figure 8: P- values for the distribution of Table [T] to be Poisson 
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4.5 A Hardy- Weinberg law for Rhesus blood groups 

In a population with suitably random mating, the proportions of pairs of Rhesus haplotypes 
in members of the population (ea ch member has one pair) c an be expected to follow the 
Hardy- Weinberg law discussed by iGuo and Thompson! (119921 ) , namely to arise via random 
sampling from the model 



rf' fc) ( 



h, #2, 



9a 



2-e r e k , j>k 

(Ok) 2 , j = k 



(17) 



for j, h — 1, 2, . . . , 9 with j > k, under the constraint that 



i=i 



(18) 



where the parameters 9%, 62, . . . , 6g are the proportions of the nine Rhesus haplotypes in 
the population (naturally, their maximum-likelihood estimates are the proportions of the 
haplotypes in the given data). For j, k — 1, 2, . . . , 9 with j > k, therefore, Pq'^ is the 
expected probability that the pair of haplotypes in the genome of an individual is the pair 
j and k, given the parameters 9\, 9 2 , ■ ■ ■ , 9a. 

In this formulation, the hypothesis of suitably random mating entails that the members 
of the sample population are i.i.d. draws from the model specified in ( 1T7|) ; if a goodness-of-fit 
statistic rejects the model with high confidence, then we can be confident that mating has 
not been s uitably random. Tab l e [3] p rovides data on n = 8297 individuals; we duplicated 
Figure 3 of IGuo and Thompson! (1l992l ) to obtain Table [3l 

The P-values calculated via 4,000,000 Monte-Carlo simulations are 
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Table 3: Frequencies of pairs of Rhesus haplotypes 
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• x 2 - -693 

• G 2 : .600 

• Freeman- Tukey: .562 

• negative log-likelihood (see Remark 14.21 b elow) : .649 

• root-mean-square: .039 

Unlike the root-mean-square, the classical statistics are blind to the significant discrepancy 
between the data and the Hardy- Weinberg model. Please note that the P-values associated 
with the classical statistics are over an order of magnitude larger than the P-value associated 
with the root-mean-square. 

Remark 4.1. For the example of the present subsection, rejecting the null hypothesis (j5J) 
from Section [3] might seem in principle to be more interesting than rejecting the assump- 
tion ([7]). Fortunately, the difference between (jSJ) and (j7]) is essentially irrelevant for the 
root-mean-square in this example. Indeed, the root-mean-square is not very sensitive to bins 
associated with the parameters whose estimated values are potentially inaccurate — the 
potentially inaccurate estimates are all small, and the root-mean-square is not very sensitive 
to bins whose probabilities under the model are small relative to others. 

Remark 4.2. The term "negative log-likelihood" used in the present section refers to the 
statistic that is simply the negative of the logarithm of the likelihood. The negative log- 
lik elihood is the same s tatist ic used in the generalization of Fisher's exact test discussed 
by lGuo and Thompson! (119921 ); unlike G 2 , this statistic involves only one likelihood, not the 
ratio of two. We mention the negative log-likelihood just to facilitate comparisons; we are 
not asserting that the likelihood on its own (rather than in a ratio) is a good gauge of the 
relative sizes of deviations from a model. 
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Table 4: Frequencies of genotypes 
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Remark 4.3. Table H] provides data on n = 45 in dividuals from the other set of real-world 



measu rements given by lGuo and Thompson! (119921 ); we duplicated Figure 2 of lGuo and Thompson 
( jl992l ) to obtain Table HJ The associated Hardy- Weinberg model is then the same as (fT7j) . 
but with only four parameters, 9%, 9 2 , 3 , #4, such that Y^j=i @j = 1- The P-values calculated 
via 4,000,000 Monte-Carlo simulations are 

• x 2 - -021 

• G 2 : .013 

• Freeman- Tukey: .027 

• negative log-likelihood (see Remark 14.21 above) : .016 

• root- mean-square: .0019 

Again the root-mean-square is more powerful than the classical statistics. 



4.6 Symmetry between the self-reported health assessments of 
foreign- and US-born Asian Americans 



Using propensity scores, lErosheva et al.l (120071 ) matched each of 335 surveyed foreign-born 
A sian Americans to a sim ilar surveyed US-born Asian American. Table duplicates Table 4 
of lErosheva et al.l (120071 ). tabulating the numbers of matched pairs reporting various com- 
binations of physical health; the propensity scores were generated without reference to the 
health ratings. Table [5] does not reveal any significant difference between foreign-born Asian 
Americans' ratings of their health and US-born Asian Americans'. Indeed, the P-values 
calculated via 4,000,000 Monte-Carlo simulations for testing the symmetry of Table [5] are 

• x 2 - -784 

• G 2 : .739 

• Freeman- Tukey: .642 

• root-mean-square: .973 
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After noting that y 2 does not reveal any statistically significant asymmetry in Tabled 



Erosheva et al.l (120071 ) reported that, "to address the issue of power of this test, we investi- 
gated what is the smallest departure from symmetry that our test could detect. . . ." Such 
an investigation requires considering modifications to Table [5j Table |6] provides one possible 
modification. The P-values calculated via 4,000,000 Monte-Carlo simulations for testing the 
symmetry of Table [6] are 

• x 2 - -109 

• G 2 : .123 

• Freeman- Tukey: .155 

• root-mean-square: .014 

Evidently, the root-mean-square is more powerful for detecting the asymmetry of Table [6j 

Table [7] provides another hypothetical cross-tabulation. The P-values calculated via 
64,000,000 Monte-Carlo simulations for testing the symmetry of Table [7] are 

• x 2 - -0015 

• G 2 : .00016 

• Freeman- Tukey: .000006, i.e., 6E-6 

• root-mean-square: .131 

The classical statistics are much more powerful for detecting the asymmetry of Table [7J con- 
trasting how the root-mean-square is more powerful for detecting the asymmetry of Table |6j 
Indeed, the root-mean-square statistic is not very sensitive to relative discrepancies between 
the model and actual distributions in bins whose associated model probabilities are small. 
When sensitivity in these bins is desirable, we recommend using both the root-mean-square 
statistic and an asymptotically equivalent variation of x 2 such as the log-likelihood-ratio G 2 . 

4.7 A modified geometric law for the species of butterflies 



Fisher et al.l (119431 ) reported on 5300 butterflies from 217 readily identified species (these 
exclude the 23 most common readily identified species) that they collected via random sam- 
pling at the Rothamsted Experimental Station in England. Figure [10] plots the numbers of 
individual butterflies collected from the 217 species when sorted in rank order (so that the 
numbers are nonincreasing) . 

To build a model appropriate for Figure [inj we must include a permutation of the bins 
as a parameter, since we have sorted the data (see Subsection 14.21 for further discussion of 
sorting and permutations). We take the model to be 

p«\6 ,6 l ) = A ei ^^ (19) 
V^oU) + 23 
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Table 5: Self-reported physical health for matched pairs of Asian Americans 
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Table 6: A variation on Table [5] 
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Table 7: Another variation on Table [5] 
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for j = 1, 2, ... , 217, where 6q is a permutation of the integers 1,2,..., 217, the parameter 
9i is a positive real number less than 1, and 



1 , s 

Aoi = —9V7 : 20 

we estimate 6q and $i via maximum-likelihood methods (thus obtaining 8q by sorting the 
frequencies into nonincreasing order). Please note that this model is not very carefully chosen 
- the model is just a truncated geometric distribution weighted by the nonsingular function 
V\M)(j) + 23, with 23 being the number of common species omitted from the collection. 
More complicated models may fit better. 

The P-values calculated via 4,000,000 Monte-Carlo simulations are 

• x 2 - -0050 

• G 2 : .349 

• Freeman- Tukey: .951 

• root-mean-square: .00002, i.e., 2E-5 

As Figure [10] indicates, the discrepancy between the empirical data and the model is sub- 
stantial, and, given the large number of draws (5300), cannot be due solely to random 
fluctuations. The log-likelihood-ratio (G 2 ) and Freeman- Tukey statistics are unable to de- 
tect this discrepancy, while the root-mean-square easily determines that the discrepancy is 
very highly significant. 



5 The power and efficiency of the root- mean- square 

In this section, we consider many numerical experiments and models, plotting the numbers 
of draws required for goodness-of-fit statistics to detect divergence from the models. We 
consider both fully specified models and parameterized models. To quantify a statistic's 
success at detecting discrepancies from the models, we use the formulation of the following 
remark. 

Remark 5.1. We say that a statistic based on given i.i.d. draws "distinguishes" the actual 
underlying distribution of the draws from the model distribution to mean that the computed 
P-value is at most 1% for 99% of 40,000 simulations, with each simulation generating n i.i.d. 
draws according to the actual distribution. We computed the P-values by conducting another 
40,000 simulations, with each simulation generating n i.i.d. draws according to the model 



distribution. Appendix A of iPerkins et al.l ( 12011ai ) uses a weaker notion of "distinguish" 



- in Appendix A we say that a statistic based on given i.i.d. draws "distinguishes" the 
actual underlying distribution of the draws from the model distribution to mean that the 
computed P-value is at most 5% for 95% of 40,000 simulations, while running simulations 
and computing P-values exactly as for the plots in the present section. 
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Figure 10: Numbers of specimens (the dots) from 217 species of butterflies (one bin per 
species), and the best-fit distribution (the lines) 



Remark 5.2. To compute the P- values for each example in Subsection 15. 2\ we should in 
principle calculate the maximum-likelihood estimate 9 for each of 40,000 simulations and (for 
each goodness-of-fit statistic) use these estimates to perform (40,000) 2 times the three-step 
procedure described in Remark 13.31 The computational costs for generating the plots in 
Subsection 15 . 21 would then be excessive. Instead, when computing the P- values as a function 
of the value of the statistic under consideration, we calculated 9 only once, using as the 
empirical data 1,000,000 draws from the underlying distribution, and (for each goodness- 
of-fit statistic) performed 40,000 times the three-step procedure described in Remark 13.31 
using the single value of 9 (but many values of 9 from Remark 13. 3D . The parameter esti- 
mates did not vary much over the 40,000 simulations, so approximating the P-values thus 
is accurate. Furthermore, when the parameter is just a permutation, as in Subsection 15.2.61 
the "approximation" described in the present remark is exactly equivalent to recomputing 
the P-values 40,000 times — we are not making any approximation at all. Please note that 
we did recalculate the maximum-likelihood estimate 9 (and 9 from Remark 13.31) for each of 
40,000 simulations when computing the values of the statistics for the simulation; however, 
when calculating the P-values as a function of the values of the statistics, we always drew 
from the model distribution associated with the same value of the parameter. 

Remark 5.3. The root-mean-square statistic is not very sensitive to relative discrepancies 
between the model and actual distributions in bins whose associated model probabilities are 
small. When sensitivity in these bins is desirable, we recommend using both the root-mean- 
square statistic and an asymptotically equivalent variation of x 2 , such as the log-likelihood- 
ratio or "G 2 " statistic. 
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5.1 Examples without parameter estimation 
5.1.1 A simple, illustrative example 

Let us first specify the model distribution to be 

A" = j, (21) 
Pf = j, (22) 

and 

for j = 3, 4, . . . , m. We consider n i.i.d. draws from the distribution 

P (l) = I (24) 



P {2) = ^ (25) 



and 



= (26) 

for j = 3, 4, . . . , m, where p^\ p^\ . . . , pj, are the same as in ( 1231) . 

Figure [TT] plots the percentage of 40,000 simulations, each generating 200 i.i.d. draws 
according to the actual distribution defined in ([54l) - (j2B|l . that are successfully detected as 
not arising from the model distribution at the 1% significance level. We computed the 
P-values by conducting 40,000 simulations, each generating 200 i.i.d. draws according to 
the model distribution defined in (|2"T1) - (|2"31 . Figure [TT] shows that the root-mean-square 
is successful in at least 99% of the simulations, while the classical x 2 statistic fails often, 
succeeding in less than 80% of the simulations for m = 16, and less than 5% for m > 256. 

Figure [12] plots the number n of draws required to distinguish the actual distribution 
defined in (j2"4"]) - (|2"B"[) from the model distribution defined in f]2"Tj) - fl2"3"j) . Remark 15.11 above 
specifies what we mean by "distinguish." Figure IT21 shows that the root-mean-square requires 
only about n = 185 draws for any number m of bins, while the classical x 2 statistic requires 
90% more draws for m = 16, and greater than 300% more for m > 128. Furthermore, the 
classical x 2 statistic requires increasingly many draws as the number m of bins increases, 
unlike the root-mean-square. 
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Figure 11: First example, with n = 200 draws; see Subsection 15.1.11 
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Figure 13: Second example; see Subsection 15.1.21 



5.1.2 Truncated power-laws 

Next, let us specify the model distribution to be 



(?) 



for j = 1, 2, 



m, where 



Ci 



Er=iVj" 



We consider n i.i.d. draws from the distribution 

(?) ^ 2 
J 2 



for j 



1, 2, 



m, where 



Co 



Er=iVi 2 ' 



(27) 



(28) 



(29) 



(30) 



Figure [L3] plots the number n of draws required to distinguish the actual distribution 
defined in ( 129]) and ( 1301) from the model distribution defined in ( 127]) and ( |28l) . Remark [5.11 
above specifies what we mean by "distinguish." Figure [131 shows that the classical \ 2 statistic 
requires increasingly many draws as the number m of bins increases, while the root-mean- 
square exhibits the opposite behavior. 
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5.1.3 Additional truncated power-laws 

Let us again specify the model distribution to be 



for j = 1, 2, . . . , m, where 



P? } = J (31) 



Cl — v^vm (32) 



We now consider n i.i.d. draws from the distribution 

Cl/2 

for j = 1, 2, . . . , m, where 



P U) = % (33) 



Cl/2 = 7 7? - (34) 

Figure [H] plots the number n of draws required to distinguish the actual distribution 
defined in ( |33l) and (1341) from the model distribution defined in (13T1) and (1321 . Remark 15.11 
above specifies what we mean by "distinguish." The root-mean-square is not uniformly more 
powerful than the other statistics in this example; see Remark 15.31 at the beginning of the 
present section. 
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Figure 15: Fourth example; see Subsection 15.1.41 



5.1.4 Additional truncated power-laws, reversed 

Let us next specify the model distribution to be 



for j = 1, 2, . . . , m, where 



# = (35) 



Cl/2 = , I r - (36) 



We now consider n i.i.d. draws from the distribution 

,0) Cl 



p^> = — (37) 
J 



for j = 1, 2, . . . , m, where 



Ci — — 7TT- (38) 

Figure [15] plots the number n of draws required to distinguish the actual distribution 
defined in ( 1371) and (!3"8l from the model distribution defined in (1351) and (1361) . Remark [5.1 1 
above specifies what we mean by "distinguish." Figure [151 shows that the classical x 2 statis- 
tic requires many times more draws than the root-mean-square, as the number m of bins 
increases. 
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Figure 16: Fifth example; see Subsection 15.1.51 
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5.1.5 A final example with fully specified truncated power-laws 

Let us next specify the model distribution to be 



C, 



for j 



1, 2, . . . , m, where 



C 2 



We again consider n i.i.d. draws from the distribution 



J 



for j 



1, 2, . . . , m, where 



Ci 



E7=iVj" 



(39) 



(40) 



(41) 



(42) 



Figure [16] plots the number n of draws required to distinguish the actual distribution 
defined in (|4ip and (j4"2"l) from the model distribution defined in fl39|) and fj40|) . Remark 15.11 
above specifies what we mean by "distinguish." The root-mean-square is not uniformly more 
powerful than the other statistics in this example; see Remark 15.31 at the beginning of the 
present section. 
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Figure 17: Sixth example; see Subsection 15.1.61 
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5.1.6 Modified Poisson distributions 

Let us specify the model distribution to be the (truncated) Poisson distribution 

(J) _ ^3m/8 { — ) , , 

for j = 1, 2, . . . , m, where 



We consider n i.i.d. draws from the distribution 

p ((3m/8)-l) = S / 10j (45) 

p (3m/8) = 45//5; ( 4 g) 
p ((3m/8)+l) = S/10j (47) 
5 = p<pW8)-l) + pj 3 ^ 8 ) + p «3W8)+l) ; (48) 

p(j) = p W (49) 

for the remaining values of j (for j = 1, 2, . . . , - 2 and j = ^ + 2, ^ + 3, . . . , m). 

Figure [T7] plots the number n of draws required to distinguish the actual distribution 
defined in ( 145|) -( H9|) from the model distribution defined in (I43p and ( jUJ) . Remark 15. II above 
specifies what we mean by "distinguish." 
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Figure 18: Seventh example; see Subsection 15.1.71 



5.1.7 A truncated power-law and a truncated geometric distribution 

Let us finally specify the model distribution to be 



CO 



for j = 1, 2, ... , 100, where 



Ci 



We consider n i.i.d. draws from the (truncated) geometric distribution 

p ij) = c t t j 

for j = 1, 2, . . . , 100, where 



1 



E100 ,,■ ' 
.7 = 1 1 



(50) 
(51) 

(52) 
(53) 



Figure [18] considers several values for t. 

Figure [18] plots the number n of draws required to distinguish the actual distribution 
defined in ( l52l and (|53l) from the model distribution defined in (|50|l and (IBTT) . Remark 15.11 
above specifies what we mean by "distinguish." See the next section, Subsection 15.2.11 for a 
similar example, this time involving parameter estimation. 
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Figure 19: First example; see Subsection 15.2.11 



5.2 Examples with parameter estimation 

5.2.1 A truncated power-law and a truncated geometric distribution 



We turn now to models involving parameter estimation, as detailed by lPerkins et al.l (I201ld ). 
Let us specify the model distribution to be the Zipf distribution 

pf{0) = j (54) 

for j — 1, 2, ... , 100, where 

Ce = ElJl77 ; (55) 

we estimate the parameter 9 via maximum-likelihood methods. We consider n i.i.d. draws 
from the (truncated) geometric distribution 

p ij) = c t t j (56) 

for j = 1, 2, . . . , 100, where 

Figure [19] considers several values for t. 

Figure [19] plots the number n of draws required to distinguish the actual distribution 
defined in (156j) and ( )571) from the model distribution defined in ( |54l) and ( l55|) . estimating the 
parameter 9 in (|54l) and (|55|) via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 



c * - ^ioo ,J ( 57 ) 



33 




Figure 20: Second example; see Subsection 15.2.21 



5.2.2 A rebinned geometric distribution and a truncated power-law 

Let us specify the model distribution to be 

p u\e) = e j - 1 (i-e) (58) 

for j = 1, 2, ... , 99, and 

tf°>) = ^ 99 ; (59) 

we estimate the parameter 9 via maximum-likelihood methods. We consider n i.i.d. draws 
from the Zipf distribution 

,0') Ct 



for j — 1, 2, ... , 100, where 



p^> = -j (60) 
3 



Figure [20] considers several values for t. 

Figure [20] plots the number n of draws required to distinguish the actual distribution 
defined in (|60|) and (!6T|) from the model distribution defined in (|58|) and (|59|) . estimating the 
parameter 9 in (|58|) and (|59|) via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 
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Figure 21: Third example; see Subsection 15.2.31 



5.2.3 Truncated shifted Poisson distributions 

Let us specify the model distribution to be the (truncated) Poisson distribution 

B ft 0i- 1 



U ~ 1)! 



for j = 1,2, 



21, where 



EL^-V(j-i) 



(62) 



(63) 



we estimate the parameter 9 via maximum-likelihood methods. We consider n i.i.d. draws 
from the distribution 

pu> — 



(64) 



for j = 1, 2, . . . , 21, where 



B f 



£-l 1 5'-- 1 +70'-i + *)!' 



(65) 



Figure [21] considers several values for t. Clearly, p(») = p { j) (5) for j = 1, 2, . . . , 21, if t = 0. 

Figure [21] plots the number n of draws required to distinguish the actual distribution 
defined in (|64p and (|65|) from the model distribution defined in ([6"2l and f[63^) . estimating the 
parameter # in (I6"2l and f[6Sl) via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 
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Figure 22: Fourth example; see Subsection I5.2.4I 
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5.2.4 An example with a uniform tail 

Let us specify the model distribution to be 



1 

2 



0. 

o, 
1 



20, 



(66) 
(67) 
(68) 

(69) 



2m — 6 

for j — 4, 5, . . . , m; we estimate the parameter # via maximum-likelihood methods. We 
consider n i.i.d. draws from the distribution 

(70) 

(71) 
(72) 

/'■■" : . 7 (73) 



p(l) 


1 

~ 4' 




1 

~ 8' 


p(3) 


1 

~ 8' 


U) = 


1 




2m — 6 



for j = 4, 5, . . . , m. 

Figure [22] plots the number n of draws required to distinguish the actual distribution 
defined in (I7D|) -( T75|) from the model distribution defined in ( 1S1)|) - (|6"9"|) . estimating the param- 
eter 6 in ( I66p - (l6"9"|) via maximum-likelihood methods. Remark 15.11 above specifies what we 
mean by "distinguish." 
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Figure 23: Fifth example; see Subsection I5.2.5I 



5.2.5 A model with an integer-valued parameter 

Let us specify the model distribution to be 



for j — 1, 2, . . . , 9, and 



= (74) 
#® = ^te) (75) 

for j — 9 + 1, 9 + 2, . . . , m; we estimate the parameter 9 via maximum-likelihood methods. 
We consider n i.i.d. draws from the distribution 

pW = i (76) 

P {2) = I (77) 
P (3) = |, (78) 

and 

^ = i^Ti < 79 > 

for j = 4, 5, . . . , m. 

Figure [23] plots the number n of draws required to distinguish the actual distribution 
defined in ( |76l) - (179l) from the model distribution defined in (1741) and ( 1751) . estimating the 
parameter 9 in (1741 and (1751) via maximum-likelihood methods. Remark 15. II above specifies 
what we mean by "distinguish." 
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Figure 24: Sixth example; see Subsection I5.2.6I 
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5.2.6 Truncated power-laws parameterized with a permutation 

Let us specify the model to be the Zipf distribution 

Stic* Cl 



p^{9) 



0(3) 



for j 



1,2, 



m, where 9 is a permutation of the integers 1, 2, 



, m, and 



(80) 



^1) 



we estimate the permutation 9 via maximum-likelihood methods, that is, by sorting the 
frequencies: first we choose j\ to be the number of a bin containing the greatest number of 
draws among all m bins, then we choose j\ to be the number of a bin containing the greatest 
number of draws among the remaining m — 1 bins, then we choose j'3 to be the number of a 
bin containing the greatest among the remaining m — 2 bins, and so on, and finally we find 
9 such that 0O*i) = !> 00*2) = 2, • • • , 9(j m ) = m. 
We consider n i.i.d. draws from the distribution 

C 2 



P 



for j = 1, 2, 



, m, where 



Co 



r 



1 



(82) 



33) 



£7=i Vj 2 ' 

Figure EH plots the number n of draws required to distinguish the actual distribution 
defined in (1521) and ( 1531) from the model distribution defined in ( 1501) and (15T1) . estimating 
the parameter 9 in ( 1501) via maximum-likelihood methods (that is, by sorting). Remark 15.11 
above specifies what we mean by "distinguish." 
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Figure 25: Seventh example; see Subsection I5.2.7I 



5.2.7 A model with two parameters 

For the final example, let us specify the model distribution to be 

(84) 

(85) 
(86) 
(87) 





02) 


= 01, 




02) 


= 01, 




02) 


= 02, 




02) 


= 02, 



and 



U)ta a \ 1 - 20i - 20 2 
Pq (01,02) = 



m — 4 

for j — 5, 6, . . . , m; we estimate the parameters 0i and 02 via maximum-likelihood methods. 
We consider n i.i.d. draws from the distribution 

= ^, (89) 

P (2) = |, (90) 

P (3) = |, (91) 

^ - h (92) 
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and 

for j — 5, 6, . . . , m. 

Figure [25] plots the number n of draws required to distinguish the actual distribution 
defined in (|89|) -( )93|) from the model distribution defined in ( ]B4"]) - (1BB]) . estimating the param- 
eters Q\ and 82 in ( 184")) - (IBB]) via maximum-likelihood methods. Remark 15.11 above specifies 
what we mean by "distinguish." 
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